Training course: Plotting Data for Communication and Exploration

Dianne Cook
Monash University
Produced for e61, September 23, 2024

Session 3: Plotting data for exploration (mostly)

time topic
15 Initial data analysis
30 Exploring data
15 constructing null samples
20 Wrap-up: questions, discussion, other topics

Initial data analysis

Role of initial data analysis

The first thing to do with data is to look at them …. usually means tabulating and plotting the data in many different ways to see what’s going on. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later.

Crowder, M. J. & Hand, D. J. (1990) “Analysis of Repeated Measures”

IDA includes:

  • describing the data and collection procedures
  • scrutinise data for
    • errors,
    • outliers/anomalies
    • missing observations
  • check assumptions needed for modeling

Exploring missing values

World Development Indicators, 2004-2022 data for selected series.

Rows: 90,972
Columns: 4
$ country_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AF…
$ series_code  <chr> "EG.CFT.ACCS.ZS", "EG.CFT.ACCS.ZS", "EG.CFT.ACCS.ZS"…
$ year         <dbl> 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012…
$ value        <dbl> 10.5, 11.9, 13.5, 15.1, 16.6, 18.3, 19.9, 21.3, 22.9…

Tidy format variables are:

  • country
  • year
  • 18 series

In long form, it can be pivoted in different ways to explore missing values on series, countries, and years.

The importance of tidy data format (1/2)

Handling missings strategy:

  1. Remove countries with too many missings
  2. Re-check series and years
  3. Remove series with too many missings
  4. Re-check countries and years
  5. Remove years with too many missings
  6. Re-check all
  7. Impute missings using a temporal imputation

Read more at R-miss-tastic.

The importance of tidy data format (2/2)

Hmm, what happened?

Illustrations from Julia Lowndes and Allison Horst

Analysis flows more neatly with tidy data format.

Exploratory data analysis

What can you do and what would you expect

OECD PISA data, sample from 2018


Rows: 612,004
Columns: 22
$ year        <fct> 2018, 2018, 2018, 2018, 2018…
$ country     <fct> ALB, ALB, ALB, ALB, ALB, ALB…
$ school_id   <fct> 800002, 800002, 800002, 8000…
$ student_id  <fct> 800251, 800402, 801902, 8035…
$ mother_educ <fct> "ISCED 3A", "ISCED 2", "ISCE…
$ father_educ <fct> "ISCED 3A", "ISCED 2", "ISCE…
$ gender      <fct> male, male, female, male, ma…
$ computer    <fct> yes, yes, no, no, yes, yes, …
$ internet    <fct> yes, yes, no, no, yes, yes, …
$ math        <dbl> 490.187, 462.464, 406.949, 4…
$ read        <dbl> 375.984, 434.352, 359.191, 4…
$ science     <dbl> 445.039, 421.731, 392.223, 5…
$ stu_wgt     <dbl> 13.51452, 13.51452, 9.50669,…
$ desk        <fct> yes, yes, yes, yes, yes, yes…
$ room        <fct> yes, yes, yes, no, yes, yes,…
$ dishwasher  <fct> NA, NA, NA, NA, NA, NA, NA, …
$ television  <fct> 3+, 1, 1, 0, 2, 1, NA, 1, 1,…
$ computer_n  <fct> 1, 1, 0, 0, 1, 1, NA, 0, 1, …
$ car         <fct> 2, 2, 0, NA, 0, NA, NA, 0, 2…
$ book        <fct> 0-10, 11-25, 0-10, 0-10, 11-…
$ wealth      <dbl> -0.0996, -0.7221, -3.6051, -…
$ escs        <dbl> 0.6747, -0.7566, -2.5112, -3…



What would you expect?

some things
# Math gap
# More books higher score

Explore the gap

Math gap is not universal. 😱

There are now many countries where girls score higher on average than boys.

On the other hand, the reading gap is universal. Girls universally score higher than boys on average. 🤯

Scores relative to TVs

Longitudinal wages: overall trend

Log(wages) of 888 individuals, measured at various times in their employment US National Longitudinal Survey of Youth.

Wages tend to increase as time in the workforce gets longer, on average.

The higher the education level achieved, the higher overall wage, on average.

Eating spaghetti

Consider:

  • sampling individuals
  • longnostics for individuals
  • diagnostics for statistical models

Few individuals experience wages like the overall trend.

Explore individual patterns

Measuring interesting

Compute longnostics for each subject, for example,

  • Slope, intercept from simple linear model
  • Variance, standard deviation
  • Jumps, differences

Principles to consider

  • Map out what can be computed, plotted
  • Think about what you would expect to find
  • What’s missing? Data collected may not support reliable findings E.g. Atlas of Living Australia
  • Compared to what. Some observation samples might be “fixed” by matching on some variables.
  • Is what you see really there?

Adding interaction

Connecting series to tignostics

Read more about tsibbletalk here

Creating null samples to assess the strength of patterns

Does this model fit?



Which plot is most different?


null sample method
# Simulated data from a polynomial shape 
# which tries to model three-point success

Is there really a relationship?



Which plot is most different?


null sample method
# Permute the class variable 
# which breaks association

Resources

Wrap Up - Questions?

End of session 3

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.